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Abstract. It is well-known that, given a probability distribution over n characters, in the 
worst case it takes 0(n\ogn) bits to store a prefix code with minimum expected code- 
word length. However, in this paper we first show that, for any < e < 1/2 with 1/e = 
£'(polylog(n)), it takes 0(nloglog(l/e)) bits to store a prefix code with expected codeword 
length within e of the minimum. We then show that, for any constant c > 1, it takes 
0(n"'^/'^ log n) bits to store a prefix code with expected codeword length at most c times 
the minimum. In both cases, our data structures allow us to encode and decode any character 
in 0(1) time. 

1 Introduction 

Compression is most important when space is in short supply, so popular compressors 
are usually heavily engineered to reduce their space usage. Theory has lagged behind 
practice in this area, however, and there remain basic open questions about the space 
needed for even the simplest kinds of compression. For example, while compression 
with prefix codes is familiar to any student of information theory, very little has 
been proven about compression o/ prefix codes. Suppose we are given a probability 
distribution P over an alphabet of n characters. Until fairly recently, the only general 
bounds known seem to have been, first, that it takes 0{n\ogn) bits in the worst case 
to store a prefix code with minimum expected codeword length and, second, that it 
takes 0{n) bits to store a prefix code with expected codeword length within 1 of the 
minimum. 

In 1998 Adler and Maggs [1] showed it generally takes more than (9/40)n^/^^°'^'* logn 
bits to store a prefix code with expected codeword length at most cH{P), where H{P) 
is P's entropy and a lower bound on the expected codeword length. (In this paper 
wc consider only binary codes, and by log we always mean logg.) In 2006 Gagie [5, 
6] (see also [7]) showed that, for any constant c > 1, it takes 0{n^/^\ogn^ bits to 
store a prefix code with expected codeword length at most cH{P) + 2. He also showed 
his upper bound is nearly optimal because, for any positive constant e, we cannot 
always store a prefix code with expected codeword length at most cH{P) + o(logn) 
m bits. Gagie proved his upper bound by describing a data structure that 

stores a prefix code with the prescribed expected codeword length in the prescribed 
space and allows us to encode and decode any character in time at most proportional 
to its codeword's length. This data structure has three obvious defects: when c = 1, 
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it is as big as a Huffman tree, whereas its redundancy guarantee can be obtained 
with just 0{n) bits [9]; when H{P) is small, a possible additive increase of 2 in the 
expected codeword length may be prohibitive; and it is slower than the state of the 
art. 

In this paper we answer several open questions related to efficient representation of 
codes. First, in Section 3 we show that, for any < e < 1/2 with l/e — C(polylog(n)), 
it takes C(n log log (l/e)) bits to store a prefix code with expected codeword length 
within e of the minimum. Thus, if we can tolerate an additive increase of, say, 0.01 
in the expected codeword length, then we can store a prefix code using only 0{1) 
bits per character. Second, in Section 4 we show that, for any constant c > 1, it 
takes 0(n^^'^\og7ij bits to store a prefix code with expected codeword length at most 
c times the minimum, with no extra additive increase. Thus, if we can tolerate a 
multiplicative increase of, say, 2.01 then we can store a prefix code in 0(^/n) bits. In 
both cases, our data structures allow us to encode and decode any character in 0{1) 
time. 

2 Related work 

A simple pointer-based implementation of a Huffman tree takes 0{n log n) bits, and 
it is not difficult to show this is an optimal upper bound for storing a prefix code with 
minimum expected codeword length. For example, suppose we are given a permuta- 
tion TT over n characters. Let P be the probability distribution that assigns probability 
1/2* to the 7r(i)th character, for 1 < i < n, and probability l/2"'~-'^ to the 7r(n)th char- 
acter. Since P is dyadic, every prefix code with minimum expected codeword length 
assigns a codeword of length i to the 7r(i)th character, for 1 < i < n, and a codeword 
of length n — 1 to the 7r(n)th character. Therefore, given any prefix code with minimum 
expected codeword length and a bit indicating whether 7r(n — 1) < 7r(n), we can find 
TT. Since there are n\ choices for tt, in the worst case it takes i7(logn!) = J7(nlogn) 
bits to store a prefix code with minimum expected codeword length. 

Considering the argument above, it is natural to ask whether the same lower 
bound holds for probability distributions that are not so skewed, and the answer is 
no. A prefix code is canonical [18] if the first codeword is a string of Os and any other 
codeword can be obtained from its predecessor by adding 1, viewing its predecessor 
as a binary number, and appending some number of Os. (See, e.g., [15, 13] for more 
recent work on canonical codes.) Given any prefix code, without changing the length 
of the codeword assigned to any character, we can put the code into canonical form 
by just exchanging left and right siblings in the code-tree. Moreover, we can reassign 
the codewords such that, if a character is lexicographically the jth with a codeword of 
length £, then it is assigned the jth consecutive codeword of length i. It is clear that it 
is sufficient to store the codeword length of each character to be able of reconstructing 
such a code, and thus the code can be represented in 0{n log L) bits, where L is the 
longest codeword. 

The above gives us a finer upper bound. For example, Katona and Nemetz [12] 
showed that, if a character has probability p, then any Huffman code assigns it a code- 
word of length at most about Iog(l/p)/ log 0, where (f) ~ 1.618 is the golden ratio, 
and thus L is at most about 1.441og(l/pmin), where Pmin is the smallest probability in 
P. Alternatively, one can enforce a value for L and pay a price in terms of expected 
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codeword length. Milidiii and Laber [14] showed how, for any L > [logn], we can 
build a prefix code with maximum codeword length at most L and expected code- 
word length within l/0^-riog(n+riogn]-L)]-i ^^^q minimum. Their algorithm works by 
building a Huffman tree Ti; removing all the subtrees rooted at depth greater than 
L; building a complete binary tree T2 of height h whose leaves are those removed 
from Ti; finding the node v at depth L — h — 1 in Ti whose subtree Ts's leaves are 
labelled by characters with minimum total probability (which they showed is at most 
l/(f)^-\^°s{n+\iogn]-L)']-iy^ replacing v by a new node whose subtrees are T2 and 

A simple upper bound for storing a prefix code with expected codeword length 
within a constant of the minimum, follows from Gilbert and Moore's proof [9] that we 
can build an alphabetic prefix code with expected codeword length less than H{P) + 2 
and, thus, within 2 bits of the minimum. Moreover, in an optimal alphabetic prefix 
code, the expected codeword length is within 1 of the minimum [17, 19] which, in 
turn, is within 1 of the entropy H{P). In an alphabetic prefix code, the lexicographic 
order of the codewords is the same as that of the characters, so we need store only the 
code-tree and not the assignment of codewords to characters. If we store the code- 
tree in a succinct data structure due to Munro and Raman [16], then it takes 0{n) 
bits and encoding and decoding any character takes time at most proportional to its 
codeword length. This can be improved to Oil) by using table lookup, but doing so 
may worsen the space bound unless we also restrict the maximum codeword length, 
which may in turn increase in the expected codeword length. 

The code-tree of a canonical code can be stored in just 0{L^) bits: By its definition, 
we can reconstruct the whole canonical tree given only the first codeword of each 
length. Unfortunately, Gagie's lower bound [6] suggests we generally cannot combine 
results concerning canonical codes with those concerning alphabetic prefix codes. 

Constant-time encoding and decoding using canonical code-trees is simple. Notice 
that if two codewords have the same length, then the difference between their ranks 
in the code is the same as the difference between the codewords themselves, viewed 
as binary numbers. Suppose we build an (9(L^)-bit array A and a dictionary D sup- 
porting predecessor queries, each storing the first codeword of each length. Given the 
length of a character's codeword and its rank among codewords of the same length 
(henceforth called its offset), we can find the actual codeword by retrieving the first 
codeword of that length from A and then, viewing that first codeword as a binary 
number, adding the offset minus 1. Given a binary string starting with a codeword, 
we can find that codeword's length and offset by retrieving the string's predecessor in 
D, which is the first codeword of the same length; truncating the string to the same 
length in order to obtain the actual codeword; and subtracting the first codeword 
from the actual codeword, viewing both as binary numbers, to obtain the offset mi- 
nus 1. (If D supports numeric predecessor queries instead of lexicographic predecessor 
queries, then we store the first codewords with enough Os appended to each that they 
are all the same length, and store their original lengths as auxiliary information.) 
Assuming it takes 0{1) time to compute the length and offset of any character's 
codeword given that character's index in the alphabet, encoding any character takes 
0{1) time. Assuming it takes 0{1) time to compute any character's index in the al- 
phabet given its codeword's length and offset, decoding takes within a constant factor 
of the time needed to perform a predecessor query on D. For simplicity, in this paper 
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we consider the number used to represent a character in the machine's memory to be 
that character's index in the alphabet, so finding the index is the same as finding the 
character itself. 

In a recent paper on adaptive prefix coding, Gagie and Nekrich [8] (see also [11]) 
pointed out that if L = 0{w), where w is the length of a machine word, then we can 
implement D as an C(w^)-bit data structure due to Predman and Willard [4] such 
that predecessor queries take 0{1) time. This seems a reasonable assumption since, 
for any string of length m with logm = 0{w), if P is the probability distribution 
that assigns to each character probability proportional to its frequency in the string, 
then the smallest positive probability in P is at least 1/m; therefore, the maximum 
codeword length in either a Huffman code or a Shannon code for P is 0{w). Gagie and 
Nekrich used 0{n log n)-bit arrays to compute the length and offset of any character's 
codeword given that character's index in the alphabet, and vice versa, and thus 
achieved 0{1) time for both encoding and decoding. 

A technique we will use to obtain our result is the wavelet tree of Grossi et al. [10], 
and more precisely the multiary variant due to Ferragina et al. [2]. The latter rep- 
resents a sequence <S'[l,n] over an alphabet E of size a such that the following op- 
erations can be carried out in 0( , ) time on the RAM model with a computer 

Vloglogny ^ 

word of length i7(logn): (1) Given i, retrieve S[i]] (2) given i and c e Z", compute 
rankc{S, i), the number of occurrences of c in S'[l, i]; (3) given j and c G Z", compute 
selectc{S, j), the position in S of the j-th occurrence of c. The wavelet tree requires 

bits of space, where Hq{S) < nloga is the empirical zero-order 
entropy of S, defined as Hq{S) = H{{nc/n}c<^„), where ric is the number of occur- 
rences of c in S. Thus nHo{S) is a lower bound to the output size of any zero-order 
compressor apphed to S. It will be useful to write Ho{S) — J2cea ^ log ^■ 



3 Additive increase in expected codeword length 

In this section we exchange a small additive penalty over the optimal prefix code 
for a space-efficient representation of the encoding, which in addition enables en- 
code/decode operations in constant time. 

It follows from Milidiii and Laber's bound [14] that, for any e > 0, there is always 
a prefix code with maximum codeword length L — [logn] -|- [log(2/e)] and expected 
codeword length within 

I < I < e-i^g-^ < e 

0L-riog(n+riognl-L)l-l - ^L-flognl-l - " 

of the minimum. The techniques described in the previous section give a way to store 
such a code in 0{L'^ + n log L) bits, yet it is not immediate how to do constant-time 
encoding and decoding. Alternatively, we can achieve constant-time encoding and 
decoding using 0{w^ + nlogn) bits for the code-tree. 

To achieve constant encoding and decoding times without ruining the space, we 
use multiary wavelet trees. We use a canonical code, and sort the characters (i.e., 
leaves) alphabetically within each depth, as described in the previous section. Let 
S[l, n] be the sequence of depths in the canonical code-tree, so that S[c\ (1 < c < n) is 
the depth of the character c. Now, the depth and offset of any c e is easily computed 
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from the wavelet tree of S: the depth is just S[c], while the offset is ranksic]{S,c). 
Inversely, given a depth d and an offset o, the corresponding character is select(i{S, o). 
The 0{w'^)-hit data structure of Gagie and Nekrich [8] converts in constant time 
pairs (depth, offset) into codes and vice versa, whereas the multiary wavelet tree on S 
requires nlogL + ^( "iog'°n" ) ^^^^ '^^ space and completes encoding/decoding in time 
^( logfogn ) • Under the restriction 1/e = polylog(n), the space is 0{w'^) + nlogL + o{n) 
and the time is 0{1). 

Going further, we note that Ho{S) is at most log(L — flogn] + 1) + C(l), so 
we can store S in 0{nlog{L — logn + 1) + n) — C(nloglog(l/e)) bits. To see this, 
consider S as two interleaved subsequences. Si and ^2, of length rii and ^2, with 
containing those lengths less than or equal to [log n] and 5*2 containing those greater. 
Thus nHo{S) < niHo{Si) + n2Ho{S2) + n. 

Since there are at most 2^ codewords of length £, assume we complete Si with 
spurious symbols so that it has exactly 2^ occurrences of symbol £. This completion 
cannot decrease niHo{Si) = I]i<c<[iogrt] ""-clog as increasing some ric to Uc + 1 
produces a difference of f{n) — f{nc) > 0, where f{x) = {x +1) \og{x + 1) — a; log x is 
increasing. Hence we can reason as if 5*1 contained exactly 2^ occurrences of symbol 
1 < < Rogn], where straightforward calculation shows that niHo{Si) = 0{ni). 

On the other hand, 5*2 contains at most L — [logn] distinct values, so Ho{S2) < 
log(L — [logn]), unless L = [logn], in which case 5*2 is empty and n2i?o('S'2) = 0. 
Thus n2Ho{S2) < n2log[log(2/e)l = 0(n2 log log(l/e)). 

Combining both bounds, we get Hq{S) = 0{1 + loglog(l/e)). If we assume the 
text to encode is of length m = 2^(^), as usual under the RAM model of computation, 
then L — 0{w) and the following theorem holds. 

Theorem 1. For any < e < 1/2 with 1/e = (9(polylog(n)), and under the RAM 
model with computer word size w, so that the text to encode is of length 2'^^''"\ we 
can store a prefix code with expected codeword length within an additive term e of 
the minimum, using Oiw"^ + nloglog(l/e)) hits, such that encoding and decoding any 
character takes 0{1) time. 

In other words, under mild assumptions, we can store a code using 0{n log log(l/ e)) 
bits at the price of increasing the average codeword length by e, and in addition have 

constant-time encoding and decoding. For constant e, this means that the code uses 
just 0{n) bits at the price of an arbitrarily small constant additive penalty over the 
shortest possible prefix code. 

4 Multiplicative increase in expected codeword length 

In this section we focus on a multiplicative rather than additive penalty over the op- 
timal prefix code, in order to achieve a sublinear-sized representation of the encoding, 
which still enables constant-time encoding and decoding. 

Our main idea is to divide the alphabet into probable and improbable characters 
and to store information about only the probable ones. Given a constant c > 1, 
we use Milidiii and Laber's algorithm [14] to build a prefix code with maximum 
codeword length L = [logn] + [l/(c — 1)] + 1. We call a character's codeword 
short if it has length at most L/c + 2, and long otherwise. Notice there are at most 



5 



Paper Submitted to PSC 




2L/c+3_-^ _ Q^ji^/cj characters with short codewords. Also, although applying Milidiii 

and Laber's algorithm may cause some exceptions, characters with short codewords 
are usually more probable than characters with long codewords. We will hereafter 
call infrequent characters those encoded with long codewords in the code of Milidiu 
and Laber. 

We transform this length- restricted prefix code as described in Section 2, namely, 
we sort the characters lexicographically within each depth. We use a dictionary data 
structure F due to Fredman, Komlos and Szemeredi [3] to store the indices of the 
characters with short codewords. This data structure takes 0(n^^'^\ognj bits and 
supports membership queries in 0{1) time, with successful queries returning the tar- 
get character's codeword. We also build lL/c\ +2 arrays that together store the 
indices of all the characters with short codewords; for 1 < £ < |_L/cJ + 2, the ^th 
array stores the indices the characters with codewords of length £, in lexicographic 
order by codeword. Again, we store the first codeword of each length in O^w"^) bits 
overall, following Gagie and Nekrich [8], such that it takes 0{1) time to compute any 
codeword given its length and offset, and vice versa. With these data structures, we 
can encode and decode any character with a short codeword in 0{1) time. To encode, 
we perform a membership query on the dictionary to check whether the character has 
a short codeword; if it does, we receive the codeword itself as satellite information 
returned by the query. To decode, we first find the codeword's length i and offset j 
in C(l) time as described in Section 2. Since the codeword is short, i < \_L/c\ + 2 
and the character's index is stored in the jth cell of the Ith array. 

We replace each long codeword with new codewords: instead of a long codeword a 
of length I, we insert 2^+^~^ new codewords a ■ s, where • denotes concatenation and 
s is an arbitrary binary string of length L + 1 — i. Figure 1 shows an example. Since 
c > 1, we have n^^^ < n/2 for sufficiently large n, so we can assume without loss of 
generality that there are fewer than n/2 short codewords; hence, the number of long 
codewords is at least n/2. Since every long codeword is replaced by at least two new 
codewords, the total number of new codewords is at least n. Since new codewords 
are obtained by extending all codewords of length £>L/c + lina canonical code, 
all new codewords are binary representations of consecutive integers. Therefore the 
i-th new codeword equals to -|- i — 1, where a/ is the first new codeword. If a is 
an infrequent character, we encode it with the a-th new codeword, a/ -|- a — 1. To 
encode a character a, we check whether a belongs to the dictionary F. If a e F, then 
we output the codeword for a. Otherwise we encode a as «/ + a — 1. To decode a 
codeword a, we read its prefix bitstring of length L + 1 and compare with «/. 
If Sa > Q-f-i then a = is the codeword for — oif + 1. Otherwise, the codeword 
length of the next codeword a is at most L/c-\-l and a can be decoded as described 
in the previous paragraph. 

By analysis of the algorithm by Milidiu and Laber [14] we can see that the code- 
word length of a character in their length-restricted code exceeds the codeword length 
of the same character in an optimal code by at most 1, and only when the codeword 
length in the optimal code is at least L— [log n] — 1 = [1/ (c— 1)] . Hence, the codeword 
length of a character encoded with a short codeword exceeds the codeword length of 
the same character in an optimal code by a factor of at most ^y//(~l\\1"'^ ^ c. Every 



infrequent character is encoded with a codeword of length L + 1. Since the codeword 
length of an infrequent character in the length-restricted code is more than L/c + 2, 
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Figure 1. An example with n = 16 and c = 3. The black tree is the result of applying the algorithm of 
Milidiii and Laber on the original prefix code. Now, we set L = 6 according to our formula, and declare 
short the codeword lengths up to [L/cJ + 2 = 4. Short codewords are stored unaltered in a dictionary (blue). 
Longer codewords are changed: All are extended up to length L + 1 = 7 and reassigned a code according to 
their values in the contiguous slots of length 7 (in red). 



its length in an optimal code is more than L/c + 1. Hence, the codeword length of 
a long character in our code is at most j^j^ < c times greater than the codeword 
length of the same character in an optimal code. Hence, the average codeword length 
for our code is less than c times the optimal one. 



Theorem 2. For any constant c > 1, under the RAM model with computer word 
size w, so that the text to encode is of length 2'^^'^\ we can store a prefix code with 

+ n-^/'^logn) bits, such 



expected codeword length within c times the minimum in 
that encoding and decoding any character takes 0{1) time. 

Again, under mild assumptions, this means that we can store a code with expected 
length within c times of the optimum, in 0(^n^^'^ logn^ bits and allowing constant-time 
encoding and decoding. 



References 

[1] M. Adler and B. M. MaggS: Protocols for asymmetric communicatton channels. Journal of Computer 

and System Sciences, 63(4) 2001, pp. 573-596. 
[2] P. Ferragina, G. Manzini, V. Makinen, and G. Navarro: Compressed representations of sequences 

and full-text indexes. ACM Transactions on Algorithms, 3(2) 2007, article 20. 
[3] M. L. Fredman, J. KOMLOS, and E. Szemeredi: Storing a sparse table with 0(1) worst case access 

time. Journal of the ACM, 31(3) 1984, pp. 538-544. 



7 



Paper Submitted to PSC 



[4] M. L. Fredman and D. E. Willard: Surpassing the information theoretic bound with fusion trees. 

Journal of Computer and System Sciences, 47(3) 1993, pp. 424-436. 
[5] T. Gagie: Compressing probability distributions. Information Processing Letters, 97(4) 2006, pp. 133- 

137. 

[6] : Large alphabets and incompressibility. Information Processing Letters, 99(6) 2006, pp. 246-251. 

[7] : Dynamic asymmetric communication. Information Processing Letters, 108(6) 2008, pp. 352-355. 

[8] T. Gagie and Y. Nekrich: Worst-case optimal adaptive prefix coding, in Proceedings of the Algorithms 

and Data Structures Symposium (WADS), 2009, to appear. 
[9] E. N. Gilbert and E. F. Moore: Variable-length binary encodings. Bell System Technical Journal, 

38 1959, pp. 933-967. 

[10] R. Grossi, a. Gupta, and J. Vitter: High-order entropy- compressed text indexes, in Proc. 14th 

Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 2003, pp. 841 850. 
[11] M. Karpinski and Y. Nekrich: A fast algorithm for adaptive prefix coding. Algorithmica, to appear. 
[12] G. O. H. Katona and T. O. H. Nemetz: Huffman codes and self-information. IEEE Transactions 

on Information Theory, 22(3) 1976, pp. 337-340. 
[13] S. T. Klein: Skeleton trees for the efficient decoding of Huffman encoded texts. Information Retrieval, 
3(4) 2000, pp. 315-328. 

[14] R. L. MiLiDiu and E. S. Laber: Bounding the inefficiency of length-restricted prefix codes. Algorith- 
mica, 31(4) 2001, pp. 513-529. 

[15] A. Moffat and A. Turpin: On the implementation of minimum-redundancy prefix codes. IEEE 
Transactions on Communications, 45(10) 1997, pp. 1200-1207. 

[16] J. I. Mlnro and V. Raman: Succinct representation of balanced parentheses and static trees. SIAM 
Journal on Computing, 31(3) 2001, pp. 762-776. 

[17] N. Nakatsu: Bounds on the redundancy of binary alphabetical codes. IEEE Transactions on Information 
Theory, 37(4) 1991, pp. 1225 1229. 

[18] E. S. SCHWARZ AND B. Kallick: Generating a canonical prefix encoding. Comrrmnications of the 
ACM, 7(3) 1964, pp. 166-169. 

[19] D. Sheinwald: On binary alphabetic codes, in Proceedings of the Data Compression Conference, 1992, 
pp. 112-121. 



8 



